# Introduction
The Kernel will discuss model selection and regularization techniques with data containing information on Boston areas. The data will be used to determine the relationship between crime rate and several other variables in the Boston data set.
First, the data will be looked at in an exploratory manner. Then, a manual model selection will be performed. Afterwards, an automatic model search using best subset selection, the lasso, and ridge regression will be done. The results from each will be discussed. Then a final model will be selected that performed well on the data set, and the reasoning behind the choice will be justified.
This paper will be looking at the Boston data frame from the MASS library in R. The data frame contains housing, neighborhood, and location related information about Boston suburbs. The data frame has 506 rows and 14 columns/variables. With that being said, here is a brief description of the variables:
# Make Variable and Description Table
rm(list=ls()) # Clear the workspace
graphics.off() # Clear graphics
tableCounter = 1
figCounter = 0
library(knitr) # datatables
Variable = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax","ptratio", "black", "lstat","mdev")
Description = c("per capita crime rate by town",
"proportion of residential land zoned for lots over 25,000 sq.ft.",
"proportion of non-retail business acres per town.",
"Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).",
"nitrogen oxides concentration (parts per 10 million).",
"average number of rooms per dwelling.",
"proportion of owner-occupied units built prior to 1940.",
"weighted mean of distances to five Boston employment centers.",
"index of accessibility to radial highways.",
"full-value property-tax rate per $10,000.",
"pupil-teacher ratio by town.",
"1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.",
"lower status of the population (percent).",
"median value of owner-occupied homes in $1000s.")
df = data.frame(Variable, Description)
kable(df, format = 'markdown')
| Variable | Description |
|---|---|
| crim | per capita crime rate by town |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centers. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| mdev | median value of owner-occupied homes in $1000s. |
library(MASS) # Data source
library(GGally) # ggpairs plot
library(ggplot2) # Other plotting
library(leaps) # regsubsets
library(glmnet) # Ridge Regression and Lasso
library(plotmo) # Plotting glmnet
library(car) # vif among other things
library(knitr)
library(htmltools) # Table Header Editing
data(Boston) # load data
# Convert Appropriate variables to categorical (factor)
Boston$chas = as.factor(Boston$chas)
#Boston$rad = as.factor(Boston$rad) # fine as numeric according to professor comment in discussion module
# Read in and prep the data
if (sum(is.na(Boston)) >0) { # to remove any rows with missing values
Boston = na.omit(Boston)
}
# Set global option for significant digits in output
options(digits=3)
As mentioned in the introduction, the Boston data will be explored. Then, a manual model search will be done to determine the relationship between crime rate (crim) and the other variables in the data set. Afterwards, a more automatic model search will be conducted using three separate methods: best subset, ridge regression, and lasso. These methods will be used to see which method can predict crim the best using a test mean square error. Once the model search is complete, a final model will be selected along with the justification for that model.
Let us take a look at the summary of the Boston data frame that is produced by R.
summary(Boston)
## crim zn indus chas nox
## Min. : 0.0 Min. : 0.0 Min. : 0.46 0:471 Min. :0.385
## 1st Qu.: 0.1 1st Qu.: 0.0 1st Qu.: 5.19 1: 35 1st Qu.:0.449
## Median : 0.3 Median : 0.0 Median : 9.69 Median :0.538
## Mean : 3.6 Mean : 11.4 Mean :11.14 Mean :0.555
## 3rd Qu.: 3.7 3rd Qu.: 12.5 3rd Qu.:18.10 3rd Qu.:0.624
## Max. :89.0 Max. :100.0 Max. :27.74 Max. :0.871
## rm age dis rad tax
## Min. :3.56 Min. : 2.9 Min. : 1.13 Min. : 1.00 Min. :187
## 1st Qu.:5.89 1st Qu.: 45.0 1st Qu.: 2.10 1st Qu.: 4.00 1st Qu.:279
## Median :6.21 Median : 77.5 Median : 3.21 Median : 5.00 Median :330
## Mean :6.28 Mean : 68.6 Mean : 3.80 Mean : 9.55 Mean :408
## 3rd Qu.:6.62 3rd Qu.: 94.1 3rd Qu.: 5.19 3rd Qu.:24.00 3rd Qu.:666
## Max. :8.78 Max. :100.0 Max. :12.13 Max. :24.00 Max. :711
## ptratio black lstat medv
## Min. :12.6 Min. : 0 Min. : 1.7 Min. : 5.0
## 1st Qu.:17.4 1st Qu.:375 1st Qu.: 7.0 1st Qu.:17.0
## Median :19.1 Median :391 Median :11.4 Median :21.2
## Mean :18.5 Mean :357 Mean :12.7 Mean :22.5
## 3rd Qu.:20.2 3rd Qu.:396 3rd Qu.:17.0 3rd Qu.:25.0
## Max. :22.0 Max. :397 Max. :38.0 Max. :50.0
Looking at chas, a categorical variable in the data set, a large majority (>90%) of the observations have 0, which indicates most of the areas do not tract bound the Charles River.
The variable we are trying to determine, crim, seems to be have a small range with a large amount of frequencies. One can see the minimum at 0, the median at 0.3, and the 3rd quartile at 3.7. With a max value of 89.0, this points to half of the distribution for crim are in between 0 and 0.3, while the next 25% of the values are in between 0.3 and 3.6, and then the last 25% of the values stretch from 3.6 to 89.0. Some of the other variables like zn and black exhibit similar behavior.
This can be more easily visualized in the following figure that also shows how the variables in the data are related to each other.
p = ggpairs(Boston, aes(alpha=0.1),
diag = list(continuous = "barDiag"),
lower = list(continuous = "smooth"))
suppressMessages(print(p))
From Figure 1, one can see various aspects to the data. As we mentioned before, crim, zn, and black have a few values with a large number of frequencies at one end of its distribution that leads to the distribution being heavy on one side.
Looking at the top row that shows the variable of interest, crim, and how it relates to other continuous variables via correlation coefficients, rad and tax have the two largest correlation coefficient magnitudes at 0.626 and 0.583 respectively. A majority of the other crim correlation coefficients fall in the range of 0.2 and 0.45 in magnitude, so rad and tax have relatively large values.
The correlation can also be seen visually in the bottom left portion of the figure with the scatter plots and trend lines.
For the crim scatter plots, one can see a large number of these scatter plots have points that seem to hug the y-axis. This is partly due to the fact that a majority of the values for crim fall near 0 as was mentioned earlier.
Since a large majority of crim values fall near zero, the values that are greater than 10 will have much more influence. In addition to another reason that will be discussed in the analysis for fitting a full linear model, a log transformation of crim helps alleviate this situation and provide for easier modeling.
Below, we transform crim by applying the log function to all crim values and then remake a figure similar to Figure 1.
BostonCrimLog = Boston
BostonCrimLog$crim = log(Boston$crim)
p = ggpairs(BostonCrimLog, aes(alpha=0.1),
diag = list(continuous = "barDiag"),
lower = list(continuous = "smooth"))
suppressMessages(print(p))
Looking at the crim histogram, one can now see a distribution that is closer to a normal distribution. It looks like it might be a bi-modal distribution; however, a few points will not have undue influence now unlike before. Also, one can see all the correlation coefficients in the first row have increased in magnitude from about (0.2, 0.65) to (0.4, 0.85). This helps point to variables having a stronger relationship with crim (after crim is log transformed).
Rad and tax still having the largest correlation coefficients as was stated before. Taking a closer at tax and rad, one can see they are highly correlated (>0.90) to each other. This may show up as multicolinearity if both of these variables are included in the model. This will be looked at in more detail later in the analysis.
From the scatter plot for rad and tax, one can see all the higher values of rad have similar higher values for tax. This results in a ‘clump’ of data seen in the top right part of the plot. This also helps explain a high correlation between the two variables.
With the data explored, let’s look into how a model can be selected to determine the relationship between crime rate (crim) and the other variables in the data set.
In this section, we will explore the full model initially. Afterwards, variables that are deemed problematic or insignificant will be manually removed and a final model will be looked at.
With my current knowledge in crime rates, I know that poorer areas tend to have higher crime rates. In this paper, this means that median house value (medv) and crim (crime rate) should have a (inverse) relationship and hopefully this proves to be true. I do not have enough knowledge to make a logical association or decision on the other variables.
First, let’s look at a model containing all the other variables in the model (aside from crim) to determine the per capita crime rate by town (crim).
We will begin by checking some of the linear assumptions via plots.
par(mfrow=c(2,2))
plot(lm(crim ~ ., data=Boston))
One can see that large fitted values present issues by having large residuals. Also, the residual variance of the larger fitted values is larger than the lower fitted values. This violates the constant variance assumption and makes it hard to tell the pattern of the residual trend line near zero. One method to remedy this is to transform the variable being predicted. In this case, crim is the variable being predicted, so we will transform crim by applying a log function to all the crim values. This was mentioned earlier in the exploratory data analysis section. It should also be noted that the normality plot points to having a heavy tail, which a log transformation can also help remedy.
After applying a log transformation to crim, we get the following plots.
par(mfrow=c(2,2))
plot(lm(log(crim) ~ ., data=Boston))
From Figure 4, the constant variance assumption is no longer violated as the spread of the residuals remains relatively constant throughout the residuals vs fitted values plot. Also, linearity seems satisfied as the mean of the residuals in the same plot stay near zero as one moves left to right on the plot. From the top right plot, normality of the residuals is much better as almost all the points are close to the ideal normality line. In the bottom right plot, none of the points have a large Cook’s distance, which helps indicate none of the points have unneeded influence in the model building.
Overall, the linear model assumptions of linearity, normality, and constant variance do not seem to be violated from these plots.
From this point on, the crim variable will be log transformed.
We can go over some of the results of the full linear model in the following table
# Basic linear model (full data set)
Boston$crim = log(Boston$crim) # log transform crim.
crim.lm = lm(crim ~ ., data=Boston)
crim.lm.summary = summary(crim.lm)
crim.lm.coef = t(t(crim.lm.summary$coefficients)) # Save off results
#look at VIF to see if any multicolinearity exists
crim.lm.vif = t(t(vif(crim.lm)))
colnames(crim.lm.vif) = "VIF"
rownames(crim.lm.vif)[3] = "chas1"
fullLMresults = merge(crim.lm.vif,crim.lm.coef, by = 0, all = TRUE)
rownames(fullLMresults) = fullLMresults$Row.names
fullLMresults = fullLMresults[-1]
kable(fullLMresults, format = 'markdown')
| VIF | Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|---|
| (Intercept) | NA | -3.733 | 0.869 | -4.297 | 0.000 |
| age | 3.10 | 0.006 | 0.002 | 2.782 | 0.006 |
| black | 1.37 | -0.002 | 0.000 | -3.404 | 0.001 |
| chas1 | 1.09 | -0.048 | 0.142 | -0.339 | 0.734 |
| dis | 4.29 | -0.005 | 0.034 | -0.161 | 0.872 |
| indus | 3.99 | 0.020 | 0.010 | 1.973 | 0.049 |
| lstat | 3.56 | 0.032 | 0.009 | 3.487 | 0.001 |
| medv | 3.77 | 0.010 | 0.007 | 1.441 | 0.150 |
| nox | 4.55 | 3.847 | 0.633 | 6.072 | 0.000 |
| ptratio | 1.98 | -0.041 | 0.022 | -1.837 | 0.067 |
| rad | 7.16 | 0.143 | 0.011 | 13.511 | 0.000 |
| rm | 2.26 | -0.049 | 0.074 | -0.666 | 0.506 |
| tax | 9.20 | 0.000 | 0.001 | -0.213 | 0.831 |
| zn | 2.33 | -0.012 | 0.002 | -5.194 | 0.000 |
Looking at the VIF values, one can see rad and tax both have VIF values of 7.159 and 9.195 respectively. These are greater than 5, which can indicate multicolinearity exists.
To remedy this issue, we will remove one of the variables.
The variable to remove will be determined by using the value with the higher p-value. Tax has a p-value of 0.831, which is greater than a 0.05 significance level. This tells us tax does not need to be included in the model especially considering its VIF value and how correlated it is with rad. Also, rad has a p-value of 1.323^{-35} that indicates it should be included in the model.
The full linear model also has an R-squared value of 0.875 and adjusted R-squared value of 0.872. The next model will attempt to reduce multicolinearity by removing tax from the predictors.
As was previously discussed, the variable tax will be removed from the model. The following table presents some of the results of the model, including VIF values.
# Basic linear model (without tax)
crim.lmNoTax = lm(crim ~ . - tax, data=Boston)
crim.lmNoTax.summary = summary(crim.lmNoTax)
crim.lmNoTax.coef = t(t(crim.lmNoTax.summary$coefficients)) # Save off results
#look at VIF to see if any multicolinearity exists
crim.lmNoTax.vif = t(t(vif(crim.lmNoTax)))
colnames(crim.lmNoTax.vif) = "VIF"
rownames(crim.lmNoTax.vif)[3] = "chas1"
mclFreeResults = merge(crim.lmNoTax.vif,crim.lmNoTax.coef, by = 0, all = TRUE)
rownames(mclFreeResults) = mclFreeResults$Row.names
mclFreeResults = mclFreeResults[-1]
kable(mclFreeResults, format = 'markdown')
| VIF | Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|---|
| (Intercept) | NA | -3.760 | 0.859 | -4.378 | 0.000 |
| age | 3.10 | 0.006 | 0.002 | 2.780 | 0.006 |
| black | 1.37 | -0.002 | 0.000 | -3.407 | 0.001 |
| chas1 | 1.08 | -0.045 | 0.141 | -0.320 | 0.749 |
| dis | 4.29 | -0.005 | 0.034 | -0.156 | 0.876 |
| indus | 3.23 | 0.019 | 0.009 | 2.091 | 0.037 |
| lstat | 3.54 | 0.032 | 0.009 | 3.518 | 0.000 |
| medv | 3.70 | 0.011 | 0.007 | 1.488 | 0.137 |
| nox | 4.54 | 3.841 | 0.632 | 6.075 | 0.000 |
| ptratio | 1.98 | -0.041 | 0.022 | -1.846 | 0.066 |
| rad | 2.36 | 0.141 | 0.006 | 23.244 | 0.000 |
| rm | 2.26 | -0.049 | 0.074 | -0.664 | 0.507 |
| zn | 2.19 | -0.012 | 0.002 | -5.406 | 0.000 |
Looking at the VIF values, one can see that all values are under 5, although a few values are close to 5 like nox and dis. For now, we will take note and take a look at VIF values in the next models to be cautious. There are also quite a few variables with large p-values. This will be addressed in the next section.
This multicolinearity free linear model has an R-squared value of 0.875 and adjusted R-squared value of 0.872.
The next stop will be to investigate and potentially remove variables that do not need to be included, using a significance value of 0.05. Looking at Table 3, this will be the following variables: chas1, dis, medv, ptratio, rm. After removal of these variables, we will fit a model again to check to see if these variables are affecting the p-values of any of the remaining variables enough at a 0.05 significance value. The following table presents some of the results of the model, including VIF values.
# Basic linear model
crim.lm.test = lm(crim ~ . - tax-chas-dis-rm-medv-ptratio, data=Boston)
crim.lm.final.summary = summary(crim.lm.test)
crim.lm.final.coef = t(t(crim.lm.final.summary$coefficients)) # Save off results
#look at VIF to see if any multicolinearity exists
crim.lm.final.vif = t(t(vif(crim.lm.test)))
colnames(crim.lm.final.vif) = "VIF"
manualFinalModelResults = merge(crim.lm.final.vif,crim.lm.final.coef, by = 0, all = TRUE)
rownames(manualFinalModelResults) = manualFinalModelResults$Row.names
manualFinalModelResults = manualFinalModelResults[-1] # get rid of row.names column
kable(manualFinalModelResults, format = 'markdown')
| VIF | Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|---|
| (Intercept) | NA | -4.733 | 0.307 | -15.43 | 0.000 |
| age | 2.63 | 0.006 | 0.002 | 3.19 | 0.002 |
| black | 1.31 | -0.001 | 0.000 | -3.33 | 0.001 |
| indus | 2.91 | 0.015 | 0.009 | 1.78 | 0.075 |
| lstat | 1.90 | 0.023 | 0.007 | 3.41 | 0.001 |
| nox | 3.45 | 4.317 | 0.554 | 7.80 | 0.000 |
| rad | 1.87 | 0.136 | 0.005 | 25.04 | 0.000 |
| zn | 1.60 | -0.010 | 0.002 | -5.56 | 0.000 |
Table 4 shows indus now not being needed to be included in the model at a 0.05 significance level. We will remove indus and refit the model.
After removing tax, chas, indus, dis, medv, ptratio, and rm, the final remaining predictors of age, black, lstat, nox, rad, and zn were used to determine crim. They were all deemed significant at a 0.05 significance level. Some of the results can be seen in the table below.
# Basic linear model (with all variables that must be included)
crim.lm.final = lm(crim ~ . - tax-chas-dis-rm-medv-ptratio-indus, data=Boston)
crim.lm.final.summary = summary(crim.lm.final)
crim.lm.final.coef = t(t(crim.lm.final.summary$coefficients)) # Save off results
#look at VIF to see if any multicolinearity exists
crim.lm.final.vif = t(t(vif(crim.lm.final)))
colnames(crim.lm.final.vif) = "VIF"
manualFinalModelResults = merge(crim.lm.final.vif,crim.lm.final.coef, by = 0, all = TRUE)
rownames(manualFinalModelResults) = manualFinalModelResults$Row.names
manualFinalModelResults = manualFinalModelResults[-1] # get rid of row.names column
kable(manualFinalModelResults, format = 'markdown')
| VIF | Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|---|
| (Intercept) | NA | -4.833 | 0.302 | -15.99 | 0.000 |
| age | 2.62 | 0.007 | 0.002 | 3.31 | 0.001 |
| black | 1.31 | -0.001 | 0.000 | -3.36 | 0.001 |
| lstat | 1.82 | 0.025 | 0.007 | 3.83 | 0.000 |
| nox | 2.89 | 4.714 | 0.508 | 9.28 | 0.000 |
| rad | 1.79 | 0.138 | 0.005 | 25.93 | 0.000 |
| zn | 1.54 | -0.011 | 0.002 | -6.04 | 0.000 |
Table 5 shows these variables need to be included at a 0.05 significance level. Looking at the VIF values, they all fall below 3 and have better values than previous models. The model reduction helped reduce VIF values. As mentioned earlier, my knowledge between poor areas having higher crime rates is not supported by the final model as it was not deemed significant enough to be needed in the model.
We will also check linear model assumptions as we did with the initial model.
par(mfrow=c(2,2))
plot(crim.lm.final)
Figure 5 shows similar patterns as in the initial model (with the log transformation of crim), so it does not seem to violate the linear model assumptions of linearity, normality, and constant variance.
The following figure helps check the assumption of independent errors.
par(mfrow=c(1,1))
plot(crim.lm.final.summary$residuals, main = 'Residuals vs Index', ylab='Residuals')
lines(crim.lm.final.summary$residuals)
Figure 6 shows no obvious serial pattern, so the independent error assumption does not seem to be violated.
With that being said, this final manual method model has an R-squared value of 0.872 and adjusted R-squared value of 0.871. This is only a minor drop from the full linear model that had an R-squared value of 0.875 and adjusted R-squared value of 0.872; however, we decreased the number of predictors by over half over the full linear model. This makes it a more simple and easy to interpret the model.
This section will delve into three different (more automated) methods for model searching. The first will be best subset selection. Followed by ridge regression. Then, lasso will be used. For all model searches, a 10-fold cross-validation (CV) method will be used in some fashion. Recall, the crim variable in the Boston data set is now log transformed as was mentioned earlier. This was to help alleviate some assumption issues and crim distribution concerns.
To perform best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. That is, we fit all p models that contain exactly one predictor, all models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is the ‘best’ at determining the relationship between crim and the other variables.
‘Best’ in this paper will be based on using (10-fold) cross-validated prediction error. In other words, the model that minimizes the mean squared prediction error over the folds will be the ‘best’ model. A ‘best’ model will be chosen for each number of predictors up to the model with all the predictors included.
# Create predict like function for regsubsets
predict.regsubsets=function(object,newdata,id,...){
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
} #predict.regsubsets
# Set up Folds for k-fold cross validation
k=10
set.seed(8)
folds=sample(1:k,nrow(Boston),replace=TRUE)
maxNumPred = 13
# Initialize error matrix
cv.errors=matrix(NA,k,maxNumPred, dimnames=list(NULL, paste(1:maxNumPred)))
# Perform k-fold cross validation and get errors
for(j in 1:k){
best.fit=regsubsets(crim~.,data=Boston[folds!=j,],nvmax=maxNumPred+1)
for(i in 1:maxNumPred){
pred=predict(best.fit,Boston[folds==j,],id=i)
cv.errors[j,i]=mean( (Boston$crim[folds==j]-pred)^2)
}
}
# Get mean of errors across k-folds
mean.cv.errors=apply(cv.errors,2,mean)
# Find where minimum mean error is
bestSubsetNumVars = which.min(mean.cv.errors)
BS.cv.MSE = min(mean.cv.errors)
# Get best subset coefficients using all the data (based on the number of variables found when minimizing mean.cv.error)
reg.best = regsubsets(crim~.,data=Boston, nvmax = maxNumPred+1)
BS.cv.coef = coef(reg.best,bestSubsetNumVars)
# Save off results for future use and display
modelResults = data.frame(as.list(BS.cv.coef))
modelResults$'Test MSE' = BS.cv.MSE
row.names(modelResults) = 'BestSubset'
# Move 'Test MSE' to front
modelResults = modelResults[,c(which(colnames(modelResults)=="Test MSE"),which(colnames(modelResults)!="Test MSE"))]
modelResults = t(modelResults)
In the following figure, we can see how the average mean squared error (MSE) over the different folds changes with the different number of variables in the best subset model.
# Plot mean.cv.errors
par(mfrow=c(1,1))
plot(mean.cv.errors,type="b", xlab= 'Number of Variables in Model', ylab='MSE over the folds')
abline(v=bestSubsetNumVars, lty = 2)
title('Best Subset Selection Model Errors vs Number of Variables')
From Figure 7, one can see that the best subset model with 9 variables had the lowest average/mean squared error (MSE) over the folds during k-fold CV with an MSE of 0.605. The table below gives more details on what values the lowest test MSE are along with what variables are in the model and their values.
# display results in table
kable(modelResults, format = 'markdown')
| BestSubset | |
|---|---|
| Test MSE | 0.605 |
| X.Intercept. | -4.101 |
| zn | -0.012 |
| indus | 0.020 |
| nox | 3.867 |
| age | 0.006 |
| rad | 0.141 |
| ptratio | -0.041 |
| black | -0.001 |
| lstat | 0.034 |
| medv | 0.009 |
From the Table 6, one can see the minimum test MSE was 0.605, which aligns to about where the minimum point is in Figure 7.
From reference [1], ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates, \(\hat{\beta}^R_\lambda\), are the values that minimize:
\[\sum^n_{i=1} (y_i-\beta_0-\sum^p_{j=1}\beta_jx_{ij} )^2 +\lambda\sum^p_{j=1}\beta_j^2 = RSS + \lambda\sum^p_{j=1}\beta_j^2\] where RSS is root sum squared and \(\lambda\ge0\) is a tuning parameter, to be determined separately with k-fold cross validation. The ‘best’ model will have a \(\lambda\) that will minimize the mean cross-validated error across the k-folds.
The data will be split into a training set and testing set. The training set will be used to create a model and use 10-fold CV to determine the ‘best’ lambda. The ‘best’ lambda will be the one with the lowest average MSE. Once that is determined, the lambda value and model based on the training value will be used to predict values for the test data set to get a test mean square error.
# Create our x matrix
x = model.matrix(crim~.,Boston)[,-1]
y = Boston$crim # Create our response vector (y)
set.seed(8)
# Get Training and Testing Data Set
train = sample(1:nrow(x), 3*nrow(x)/5) # wanted ~300 obs for 10-fold CV, so each fold has ~30 obs.
test = -(train)
y.test = y[test]
# Fit ridge regression on the training set
grid=10^seq(10, -2, length=100)
ridge.mod = glmnet(x[train,], y[train], alpha=0, lambda=grid, thresh=1e-12)
# Perform k-fold cross validation on all the data to determine best lambda value
ridge.cv.out = cv.glmnet(x[train,], y[train], alpha=0) # this does 10-fold cv by default, appropriate with 500 obs
ridge.bestlam=ridge.cv.out$lambda.min
# MSE associated with value of lambda that results in smallest cv-error
ridge.pred = predict(ridge.mod, s=ridge.bestlam, newx=x[test,])
ridge.MSE = mean((ridge.pred - y.test)^2)
# Now refit ridge regression on full data set, using the value of lambda
# selected by cross-validation
ridge.modF = glmnet(x, y, alpha=0, lambda=grid)
ridge.coef=predict(ridge.modF,type="coefficients",s=ridge.bestlam)[1:maxNumPred,]
# Save off Results
ridgeModelResults = data.frame(as.list(ridge.coef))
ridgeModelResults$'Test MSE' = ridge.MSE
ridgeModelResults$BestLambda = ridge.bestlam
row.names(ridgeModelResults) = 'RidgeRegression'
# Rearrange variables
ridgeModelResults = ridgeModelResults[,c(which(colnames(ridgeModelResults)=="BestLambda"), which(colnames(ridgeModelResults)!="BestLambda"))]
ridgeModelResults = ridgeModelResults[,c(which(colnames(ridgeModelResults)=="Test MSE"),which(colnames(ridgeModelResults)!="Test MSE"))]
ridgeModelResults = t(ridgeModelResults) # transpose for easier table display
In the following figure, one can see how the error changes as lambda is changed for the ridge regression model (that is based on the training data).
plot(ridge.cv.out)
From the figure, one can see the lowest MSE occurs at a log(Lambda) below 0 and has a MSE value of around 0.4. The figure also indicates this model has all 13 variables in it from the x axis on top of the plot. In fact, ridge regression always includes all 13 variables for the lambda values shown. Remember, this figure is based on the model produced by the training set.
How the variable coefficient values change with lambda is given below in a visual manner.
plot_glmnet(ridge.modF)
abline(v=log(ridge.bestlam), lty = 2)
From the figure, the dotted dash vertical line indicates the ‘best’ lambda value that results in the lowest training MSE. This figure is based on all the data (that includes both the training and testing set). One can see that the nox coefficient has the largest magnitude in the ‘best’ model selected. This is also shown in the table below that also displays the actual values along with the best lambda and test MSE values.
# display results in table
kable(ridgeModelResults, format = 'markdown')
| RidgeRegression | |
|---|---|
| Test MSE | 0.772 |
| BestLambda | 0.192 |
| X.Intercept. | -4.046 |
| zn | -0.010 |
| indus | 0.014 |
| chas1 | 0.019 |
| nox | 3.448 |
| rm | -0.020 |
| age | 0.006 |
| dis | -0.044 |
| rad | 0.102 |
| tax | 0.002 |
| ptratio | -0.020 |
| black | -0.002 |
| lstat | 0.030 |
From the Table 7, one can see the minimum test MSE was 0.772. Table 7 also shows all variables were chosen and shows their values. As seen in Figure 9, nox has largest coefficient value magnitude.
One downside to ridge regression is that it will usually generate a model involving all the predictors even if some are irrelevant. It can create a challenge in model interpretation. The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficient estimates, \(\hat{\beta}^L_\lambda\), are the values that minimize:
\[\sum^n_{i=1} (y_i-\beta_0-\sum^p_{j=1}\beta_jx_{ij} )^2 +\lambda\sum^p_{j=1}|{\beta_j}| = RSS + \lambda\sum^p_{j=1}|\beta_j|\] Similar to ridge regression, \(\lambda\ge0\) is a tuning parameter that will be determined separately with k-fold cross validation. The ‘best’ model will have a \(\lambda\) that will minimize the mean cross-validated error across the k-folds.
As with ridge regression, the data will be split into a training set and testing set. The training set will be used to create a model and use 10-fold CV to determine the ‘best’ lambda. The ‘best’ lambda will be the one with the lowest average MSE. Once that is determined, the lambda value and model based on the training value will be used to predict values for the test data set and get a test mean square error.
set.seed(1)
# Fit lasso on the training set (same one used as ridge)
grid=10^seq(10, -2, length=100) # using same grid as before (optional: could use new grid)
lasso.mod = glmnet(x[train,], y[train], alpha=1, lambda=grid, thresh=1e-12)
# Create lasso model, perform k-fold CV, and get best lambda and lowest MSE
lasso.cv.out= cv.glmnet(x[train,], y[train], alpha=1)
lasso.bestlam=lasso.cv.out$lambda.min
lasso.pred = predict(lasso.mod, s=lasso.bestlam, newx=x[test,])
lasso.MSE = mean((lasso.pred - y.test)^2) # test set MSE
# Now refit lasso on full data set, using the value of lambda
# selected by cross-validation
lasso.modF = glmnet(x, y, alpha=1, lambda=grid)
lasso.coef = predict(lasso.modF, type="coefficients", s=lasso.bestlam)[1:maxNumPred,]
# Get coefficients and save off results
lasso.coef = data.frame(as.list(lasso.coef))
lassoModelResults = data.frame(as.list(lasso.coef))
lassoModelResults$'Test MSE' = lasso.MSE
lassoModelResults$BestLambda = lasso.bestlam
row.names(lassoModelResults) = 'Lasso'
# Rearrange variables
lassoModelResults = lassoModelResults[,c(which(colnames(lassoModelResults)=="BestLambda"), which(colnames(lassoModelResults)!="BestLambda"))]
lassoModelResults = lassoModelResults[,c(which(colnames(lassoModelResults)=="Test MSE"),which(colnames(lassoModelResults)!="Test MSE"))]
lassoModelResults = t(lassoModelResults) # transpose for easier table display
In the following figure, one can see how the error changes as lambda is changed for the lasso model (based on the training data).
plot(lasso.cv.out)
From the figure, one can see the lowest MSE occurs at a log(Lambda) value near -4. The MSE value is around 0.4. The figure also indicates this model has around 8 variables in it from the x axis on top of the plot. Remember, this figure is based on the model produced by the training set.
How these variable coefficient values change with lambda (along with the number of variable coefficients) is given below in a visual manner.
plot_glmnet(lasso.modF)
abline(v=log(lasso.bestlam), lty = 2)
From the figure, the dotted dash vertical line indicates the ‘best’ lambda value that results in the lowest MSE. This figure is based on all the data (that includes both the training and testing set). One can see that the nox coefficient has the largest magnitude in the ‘best’ model selected. This is also shown in the table below that also displays the actual values along with the best lambda and test MSE values.
# display results in table
kable(lassoModelResults, format = 'markdown')
| Lasso | |
|---|---|
| Test MSE | 0.724 |
| BestLambda | 0.020 |
| X.Intercept. | -4.040 |
| zn | -0.010 |
| indus | 0.016 |
| chas1 | 0.000 |
| nox | 3.923 |
| rm | 0.000 |
| age | 0.006 |
| dis | -0.024 |
| rad | 0.138 |
| tax | 0.000 |
| ptratio | -0.024 |
| black | -0.001 |
| lstat | 0.025 |